[day 15] 分群與相似度-實作

2018 iT 邦幫忙鐵人賽

DAY 16

AI & Machine Learning

到底是在learning什麼拉系列第 16 篇

2018鐵人賽 machine learning

nylon

團隊晶心壯士

2018-01-03 20:45:00

2347 瀏覽

分享至

Loading & exploring Wikipedia data

接下來我們想要透過 tf-idf 來建立一個 document retrieval system

import graphlab

# load some text data from wiki pages on people
people = graphlab.SFrame('/home/user/nylon7/machine_learning/week4/people_wiki.gl')
people.head()

URI	name	text
http://dbpedia.org/resource/Digby_Morrell ...	Digby Morrell	digby morrell born 10october 1979 is a former ...
http://dbpedia.org/resource/Alfred_J._Lewy ...	Alfred J. Lewy	alfred j lewy aka sandylewy graduated from ...
http://dbpedia.org/resource/Harpdog_Brown ...	Harpdog Brown	harpdog brown is a singerand harmonica player who ...
http://dbpedia.org/resource/Franz_Rottensteiner ...	Franz Rottensteiner	franz rottensteiner bornin waidmannsfeld lower ...
http://dbpedia.org/resource/G-Enka ...	G-Enka	henry krvits born 30december 1974 in tallinn ...
http://dbpedia.org/resource/Sam_Henderson ...	Sam Henderson	sam henderson bornoctober 18 1969 is an ...
http://dbpedia.org/resource/Aaron_LaCrate ...	Aaron LaCrate	aaron lacrate is anamerican music producer ...

接著我們要做的是瀏覽資料，我們先具體的關注某一個人的資料

# explore the dataset and checkout the text it contains
obama = people[people['name'] == 'Barack Obama']

obama['text']
# barack hussein obama ii brk husen bm born august 4 1961 is the 44th and current president of the united states and the first african american.....

Exploring word counts

# Get the word count for obama article 

obama['word_count'] = graphlab.text_analytics.count_words(obama['text'])
print obama['word_count']

# Sort the word count for obama article

# Turning dictonary of word counts into a table
obama_word_count_table = obama[['word_count']].stack('word_count', new_column_name = ['word','count'])

# Sorting the word counts to show most common words at the top
obama_word_count_table.sort('count',ascending=False)

從中可以發現，有很多單字都是沒有意義的(the、in、and、of...)，除了obama

word	count
the	40
in	30
and	21
of	18
to	14
his	11
obama	9
act	8
he	7
a	7

Computing & exploring TF-IDFs

我們無法只針對 obama 做 TF-IDF 因為它是基於全部的文檔，你需要把那些在每篇文章中出現過的文字標準化( normalizer)

# Compute TF-IDF for the corpus

# step 1:get word counts
people['word_count'] = graphlab.text_analytics.count_words(people['text'])

# step 2:#calculate tf-idf & normalizer
tfidf = graphlab.text_analytics.tf_idf(people['word_count'])
people['tfidf'] = tfidf
tfidf

接著我們就可以檢查 Obama 的 tf-idf

# Examine the TF-IDF for the Obama article
obama[['tfidf']].stack('tfidf',new_column_name['word','tfidf']).sort('tfidf',ascending=False)

word	tfidf
obama	43.2956530721
act	27.678222623
iraq	17.747378588
control	14.8870608452
law	14.7229357618
ordered	14.5333739509
military	13.1159327785
involvement	12.7843852412
response	12.7843852412
democratic	12.4106886973

先前我們也做過類似的事情，只不過有意義的單字只有 obama ，但是這次你可以發現單字中充滿了關聯性

Computing distances between Wikipedia articles

先看看目標與觀測點間距離是如何展示的

我們選擇了 Bill Clinton 與 David Beckham ，理論上 Clinton 應該跟 obama 比較近

原因是兩個都是政治人物、前美國總統並且同為民主黨的黨員

# Manually compute distances between a few people
clinton = people[people['name'] == 'Bill Clinton']
beckham = people[people['name'] == 'David Beckham']

# Is Obama closer to Clinton than to Beckham?
graphlab.distances.cosine(obama['tfidf'][0],clinton['tfidf'][0])
# 0.8339854936884276

graphlab.distances.cosine(obama['tfidf'][0],beckham['tfidf'][0])
# 0.9791305844747478

這裡呈現的數字意含著，數字越小代表越接近，因此由結果來看合乎推測

從剛剛到現在我們已經針對個別的人來算出其距離，接下來我想做到的是一篇文章與其他篇文章的距離

Building & exploring a nearest neighbors model for Wikipedia articles

在前面的課程中我們有討論過 nearest neighbors model 以及若何將之用於文章檢索上

# Build a nearest neighbor model for document retrieval
knn_model = graphlab.nearest_neighbors.create(people,features=['tfidf'],label='name')

# Applying the nearest-neighbors model for retrieval
knn_model.query(obama)

Other examples of document retrieval

jolie = people[people['name'] == 'Angelina Jolie']
knn_model.query(jolie)

reference_label	distance	rank
Angelina Jolie	0.0	1
Brad Pitt	0.784023668639	2
Julianne Moore	0.795857988166	3
Billy Bob Thornton	0.803069053708	4
George Clooney	0.8046875	5

從結果來看，毫無疑問 obama 離自己最近，而 Joe Biden 則是他之前的副手(前美國副總統)

而離 Angelina Jolie 最近的是 Brad Pitt(丈夫)，不過我想主因是兩個人都是美國好萊塢影音，而不是因為夫妻關係

Reference:

Machine Learning Foundations: A Case Study Approach 華盛頓大學在 coursera 上的公開課程，為本筆記追蹤的主要素材
課程投影片

[day 14] 分群與相似度-4

[day 16] 推薦系統 -1

系列文

到底是在learning什麼拉共 30 篇

RSS系列文訂閱系列文

37 人訂閱

完整目錄

熱門推薦

{{ item.channelVendor }} | {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

902 組

團體組數

37 組

累計文章數

19837 篇

完賽人數

528 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 17th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# linux windows server css react

IT邦幫忙

到底是在learning什麼拉系列 第 16 篇